数据集 | 30w播客(Podcast)的560w条评论数据(2005-2023)

Original 大邓大邓和他的Python

2024-09-10

Tips: 公众号推送后内容只能更改一次，且只能改20字符。如果内容出问题，或者想更新内容，只能重复推送。为了更好的阅读体验，建议阅读本文博客版，链接地址

https://textdata.cn/blog/2024-06-03-podcasts-dataset/

一、数据集概况

媒体名称: Podcast
数据来源: https://podcasts.apple.com/
覆盖年度: 2005-12-10 ~ 2023-03-07
博客id数量: 303911
评论条数: 5607021
所含字段: podcast_id、title、content、rating、author_id、created_at、category等

规模庞大，字段内容丰富，适合社会学、新闻与传播学、语言学、经济学、管理学等领域学者开展研究。

二、读取数据

使用 pandas.read_json() 读取

2.1 podcasts.json

import pandas as pd

pdf = pd.read_json('podcasts.json', lines=True)

#查看podcasts.json字段
print(pdf.columns)
pdf

Run

Index(['podcast_id', 'itunes_id', 'slug', 'itunes_url', 'title', 'author',
       'description', 'average_rating', 'ratings_count', 'scraped_at'],
      dtype='object')

2.2 categories.json

cdf = pd.read_json('categories.json', lines=True)

#categories.json字段
print(cdf.columns)
cdf

Run

Index(['podcast_id', 'itunes_id', 'category'], dtype='object')

2.3 reviews.json

rdf = pd.read_json('reviews.json', lines=True)

#reviews.json字段
print(rdf.columns)
rdf

Run

Index(['podcast_id', 'title', 'content', 'rating', 'author_id', 'created_at'],
      dtype='object')

三、实验

3.1 筛选出含某关键词的播客名

从 podcasts.json 中筛选出含 China 或中国的播客记录

china_podcast_df = pdf[pdf['title'].fillna('').str.contains('China')]
china_podcast_df

#查看这86个播客名
print(china_podcast_df.title.values)

Run

['China Arts Podcast'
 'Made in China Podcast: International Business | Crowdfunding | Entrepreneurship'
 'Chinasource Recently Added Resources' 'TIC China Network' 'UNDP China'
 'Wellness in China' 'Party In China' 'Tails From China' 'Focus on China'
 'CEIBS China Knowledge' 'Bottled in China' 'Environment China'
 'China Money Podcast - Audio Episodes'
 'China Money Podcast - Video Episodes'
 'China Jedi Podcast: Expat Life | Chinese Culture | Business | Travel | China'
 'China Digital Marketing Podcast' 'Goodbye China Podcast'
......
 'Made In China - Noel Smith' 'Offices at China Hall Coworking Podcast'
 'Carved To China' 'Made in China' 'China Innovation Decoded'
 'Made in China' 'U.S.-China Dialogue Podcast' 'Falando de China'
 'China Business Minute' 'Linfen China' 'Young China Watchers'
 'China Business Review' 'Podbabes China' 'China Design Now (English)'
 'McKinsey Greater China' 'Governing China'
 'U.S./China Media Brief Program - Interviews' 'The China History Podcast'
 'RailsCasts China' 'Behind the Great Firewall of China Podcast'
 "China Now's Podcast" 'China: As History Is My Witness'
 'Safeguarding Dunhuang for China and the World' 'Biz China'
 'Chinaman Talks Sports' 'China in the World' 'The History of China'
 "Forbidden City: Inside the Court of China's Emperors"
 'NAFTA at Twenty: Trade, Transformation and the China Factor'
 'NAFTA at Twenty: Trade, Transformation and the China Factor (Audio Only)'
 'China and the Chinese by Herbert Allen Giles' 'China Doing Sweden'
 'China MSG' 'Yellow Star: China News' 'Made in China']

3.2 筛选出含某关键词的内容名

筛选出含 China 的节目标题，注意podcast的title不变，但是每期的内容名(title)是变化的。

#从 reviews.json 中筛选出含 China 或 中国 的评论记录
china_title_df = rdf[rdf['title'].fillna('').str.contains('China|中国')]
china_title_df

print(china_title_df.content.values)

Run

["What's a China?" 'Thanks Justin - from China'
 'American Working in China Coffee Industry' 'Babybee in China'
 'Listening From China!!' 'Right on China.' 'Excellent China Series!'
 'China Trade War episode was fantastic'
 'Really enjoyed the China / Tariff discussion' 'China Review'
 'Beautiful videos of China!' 'Learn about The Real China business'
 'Doing business in China? Listen to this!' 'China'
 "Insightful look into China's growing influence"
 'Heavy hitters share their views on China' 'Huawei (or China)'
 'Emergency China podcast was unreal' 'China Episode' 'China'
 '矮大紧老师的确是现代中国文化圈里面的高山晓辉里的奇松' 'Love the China rant' '中国好'
 'Band in China' '关于中国生活有趣的观点' 'Deep and personal angle to look at China'
 'Saying hi from China' '终于有一档中国记者做的播客' 'China’s’  Detention Camps'
 'China Tech Insider' 'She-G-string.    king of China'
......
......
 'Required listening to keep up with contemporary China'
 'Most antiChina guests and content' 'Fantastic China-centric podcast'
 'The best Podcast on China-related topics' 'Big trouble in little China'
 '中国最好的游戏广播。' '中国第一家做游戏广播的！！' 'The best game radio in China!'
 'Best Podcast on China’s History'
 'Great China Insights and interview topics'
 'Great new Content on China and Sede Vacante' '没有中国特色'
 'Into China Marketing?  This is the Podcast!'
 '“You can’t be angry at China for the lab leak...”'
 'China and the risk of nuclear conflict' 'The China threat'
 'The second-best China-Africa podcast there is!'
 'Review of Mosaic of China'
 'Really interesting look at the people who live in China'
 'A good way to peek inside of China!' 'How can we be more like China?'
 'A must listen for China policitics nerds!'
 'Great way to keep up on the EU-China relationship'
 'Interesting and informative podcast on China'
 'SCTV from the South China Sea' 'China and Omicron' 'Strangers in China'
 'China seems very scary' 'China Lockdown'
 'I travel to China regularly just to listen'
 'Best American News I Can Find in China!!!!']
Selection deleted

3.3 筛选出含某关键词的评论

#从 reviews.json 中筛选出含 China 或 中国 的评论记录
china_reviews_df = rdf[rdf['content'].fillna('').str.contains('China|中国')]
china_reviews_df

四、获取方式

200元，加微信 372335839，备注【姓名-学校-专业-博客】。

https://textdata.cn/blog/2024-06-03-podcasts-dataset/

精选内容

LIST | 社科(经管)可用数据集列表

推荐 | 文本分析库cntext2.x使用手册

LIST | 文本分析代码列表

LIST | 社科(经管)文本挖掘文献列表

代码 | 使用 MD&A文本测量「企业不确定性感知FEPU」

中国工业经济(更新) | MD&A信息含量指标构建代码实现

管理科学学报 | 使用「软余弦相似度」测量业绩说明会「答非所问程度」

文献&代码 | 使用Python计算语义品牌评分(Semantic Brand Score)

数据集(更新) | 2001-2022年A股上市公司年报&管理层讨论与分析

数据集(更新) | 372w政府采购合同公告明细数据（2024.03）

数据集 | 人民网政府留言板原始文本(2011-2023.12)

数据集 | 含人民日报/光明日报/参考消息/经济日报等 60+ 家媒体(更新至2024.05)

数据集 | 1102w条纽约时报(1920-2020)

可视化 | 人民日报语料反映七十年文化演变

数据集 | 3571万条专利申请数据集(1985-2022年)

数据集 | 专利转让数据集(1985-2021)

数据集 | 3394w条豆瓣书评数据集

数据集 | 豆瓣电影影评数据集

数据集 | 使用1000w条豆瓣影评训练Word2Vec

代码 | 使用 3571w 专利申请数据集构造面板数据

代码 | 使用「新闻数据集」计算「经济政策不确定性」指数

数据集 | 国省市三级gov工作报告文本

代码 | 使用「新闻数据」生成概念词频「面板数据」

代码 | 使用 3571w 专利申请数据集构造面板数据

继续滑动看下一个

大邓和他的Python

向上滑动看下一个

李宜雪的良知卖了2万元，真正需要声援的是罗灿宏啊

故意按摩让女生“产生欲望”后发生关系，算性侵吗？

洗牌电商圈！阿哲放话全网：挑战抖音所有机制！爆全品类大牌！

阿哲现身评论区，@一修！肉肉痛哭，无限期停播！回应舆论黑料，关闭私信评论区！

登热榜！某牙电母被S，榜一求爱遭拒！柚柚阿哲合体年度走红毯！

数据集 | 30w播客(Podcast)的560w条评论数据(2005-2023)

一、数据集概况

二、读取数据

2.1 podcasts.json

2.2 categories.json

2.3 reviews.json

三、实验

3.1 筛选出含某关键词的播客名

3.2 筛选出含某关键词的内容名

3.3 筛选出含某关键词的评论

四、获取方式

精选内容

您可能也对以下帖子感兴趣

李宜雪的良知卖了2万元，真正需要声援的是罗灿宏啊

故意按摩让女生“产生欲望”后发生关系，算性侵吗？

洗牌电商圈！阿哲放话全网：挑战抖音所有机制！爆全品类大牌！

阿哲现身评论区，@一修！肉肉痛哭，无限期停播！回应舆论黑料，关闭私信评论区！

登热榜！某牙电母被S，榜一求爱遭拒！柚柚阿哲合体年度走红毯！

生成图片，分享到微信朋友圈

数据集 | 30w播客(Podcast)的560w条评论数据(2005-2023)

一、数据集概况

二、读取数据

2.1 podcasts.json

2.2 categories.json

2.3 reviews.json

三、实验

3.1 筛选出含某关键词的播客名

3.2 筛选出含某关键词的内容名

3.3 筛选出含某关键词的评论

四、获取方式

精选内容

您可能也对以下帖子感兴趣